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Abstract 

Recent contributions have framed linear system identification as a nonparametric regularized inverse problem. Relying on £ 2 - 
type regularization which accounts for the stability and smoothness of the impulse response to be estimated, these approaches 
have been shown to be competitive w.r.t classical parametric methods. In this paper, adopting Maximum Entropy arguments, 
we derive a new £2 penalty deriving from a vector-valued kernel; to do so we exploit the structure of the Hankel matrix, thus 
controlling at the same time complexity, measured by the McMillan degree, stability and smoothness of the identified models. 
As a special case we recover the nuclear norm penalty on the squared block Hankel matrix. In contrast with previous literature 
on reweighted nuclear norm penalties, our kernel is described by a small number of hyper-parameters, which are iteratively 
updated through marginal likelihood maximization; constraining the structure of the kernel acts as a (hyper)regularizer which 
helps controlling the effective degrees of freedom of our estimator. To optimize the marginal likelihood we adapt a Scaled 
Gradient Projection (SGP) algorithm which is proved to be significantly computationally cheaper than other first and second 
order off-the-shelf optimization methods. The paper also contains an extensive comparison with many state-of-the-art methods 
on several Monte-Carlo studies, which confirms the effectiveness of our procedure. 


1 Introduction 

Although linear system identification is sometimes con¬ 
sidered a mature field, with a wide and solid literature 
summarized in the well known textbooks [35,51], the re¬ 
cent developments on regularization based methods have 
brought new insights and opened new avenues. The most 
common “classical” approaches are parametric Predic¬ 
tion Error Methods (PEM) [35,51], where model classes 
(OE, ARMAX, Box-Jenkins, state-space, etc.) are de¬ 
scribed by a finite dimensional parameter vector which is 
estimated minimizing the squared prediction errors, and 
subspace methods, which translate ideas from stochas¬ 
tic realization theory [17,32] into algorithms which work 
on measured data [54]. 

These techniques require that a model complexity (the 
order hereon) is fixed, and thus estimated, first. As an 
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alternative to the standard parametric approach, recent 
literature has proposed a Bayesian perspective, leading 
to a class of regularized methods [43,42,10,44,8,58,59]. 
The use of Bayesian inference is not new in the field of 
identification and time-series estimation: early works on 
this topic appeared in the late’70, early ’80 [1,21,30,25]; 
see [14] for an overview. 

The Bayesian paradigm considers the impulse response 
as a stochastic process whose prior distribution penal¬ 
izes undesired systems (e.g. unstable ones). This allows 
to face the so-called bias/variance trade-off by jointly 
performing estimation and model selection. 


In [43,42,10] prior distributions are designed to encode 
smoothness and stability of the impulse response to be 
estimated, leading to 1 ' 2 -type penalties so that closed- 
form solution are available. These priors can also be 
shown to be solutions of Maximum Entropy problems, 
see [46,39,5,9], 

In this paper, we focus on the identification of multi 
input-multi output (MIMO) systems, where matrix im¬ 
pulse responses have to be identified. Similar problems 
are encountered in multi-task learning where one would 
like to simultaneously estimate multiple functions while 
also exploiting their mutual information. To this aim 
[6,2,36,22,45] have considered vector-valued kernels 
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which account for the smoothness of the functions to 
be estimated. In the identification of finite dimensional 
linear MIMO systems, the coupling between different 
input-output channels is captured by Hankel matrix, 
which has finite rank equal to the McMillan degree of 
the system. 

The Hankel matrix and its properties have already been 
thoroughly exploited in subspace methods, where also 
Vector AutoRegressive Models (VARX) estimated un¬ 
der the PEM framework play a fundamental role; in 
fact it has been shown in [12] (see also [11,13]) that cer¬ 
tain subspace methods can be seen as estimation of a 
long (i.e. “nonparametric” in the context of this paper) 
VARX model followed by a suitable (data based) model 
reduction step. This paper goes one step further, by 
merging these two steps in one. While subspace methods 
reduce the order of the estimated VARX model via a 
model reduction step, in this paper regularization takes 
care of both stability and “complexity” (in terms of 
McMillan degree) at once, while estimating the VARX 
model itself. 

Within this framework, our recent works [48,49,47] have 
attempted to merge the benefits of accounting for both 
stability/smoothness as well as complexity when build¬ 
ing prior models. The main contributions of this work, 
w.r.t. the above referenced papers are: (i) development, 
by means of MaxEnt arguments, of a new kernel en¬ 
coding both complexity as well as smoothness and sta¬ 
bility (the new kernel is parametrised differently w.r.t. 
previous conference publications and also the resulting 
algorithm is different); (ii) a new tailored Scaled Gra¬ 
dient Projection algorithm for marginal likelihood opti¬ 
mization (this had been used but not derived elsewhere) 
and (iii) an extensive simulation study comparing sev¬ 
eral state-of-the-art algorithms. 

We shall now provide a more detailed description of these 
contributions as well as a brief discussion of the relevant 
literature. 

The first main goal of this paper is to develop, by means 
of Maximum Entropy arguments, a vector-valued ker¬ 
nel which accounts both for the stability of the system 
to be estimated and for its complexity, as measured by 
its McMillan degree. The prior distribution introduced 
here leads, as a special case, to an Hankel nuclear norm 
penalty, an heuristic related to that proposed in [23] as 
a convex surrogate to the rank function. In the system 
identification literature the nuclear norm heuristic has 
also been applied in the context of subspace identifica¬ 
tion [27,55,34], even in presence of incomplete datasets 
[33], to control the order of the estimated model. PEM 
methods equipped with nuclear norm penalties on the 
Hankel matrix built with the Markov parameters have 
also been considered [29,26]. Refer to [49] for a brief sur¬ 
vey on the topic. 

However, direct use of nuclear norm (or atomic) penal¬ 
ties may lead to undesired behavior, as suggested and 
studied in [40], due to the fact that nuclear norm is not 


able alone to guarantee stability and smoothness of the 
estimated impulse responses. To address this limitation, 
[15] already suggested the combination of the stabil¬ 
ity/smoothness penalty with the nuclear norm one; dif¬ 
ferently from the prior presented in this paper, the for¬ 
mulation given in [15] did not allow to adopt marginal 
likelihood maximization to estimate the regularization 
parameters. 

Exploting the structure of the prior distribution used in 
this paper we design an iterative procedure which alter¬ 
natively updates the impulse response estimate and the 
hyper-parameters defining the prior. Our algorithm is 
related to iteratively reweighted methods used in com¬ 
pressed sensing and signal processing [4,7,20,38,24] and 
so-called Sparse Bayesian Learning (SBL) [57,53]. 

Our algorithm differs from the previous literature in 
that the regularization matrix takes on a very special 
structure, described by few hyper-parameters. With this 
special structure the weights update does not admit a 
closed-form solution and thus direct optimisation of the 
marginal likelihood needs to be performed. 

To this purpose, as a second main contribution, this 
paper develops a Scaled Gradient Projection method 
(SGP), inspired by the one introduced in [3], which is 
more efficient than off-the-shelf optimization procedures 
implemented in MATLAB. 

As a final contribution, the paper provides an extensive 
simulation study, where the proposed identification al¬ 
gorithm is compared with classical and state-of-the art 
identification methods, including PEM [35], N4SID [54], 
Stable Spline [42], reweighted nuclear norm-based algo¬ 
rithms [37] and regularized “subspace” methods [55]. 

While a clear-cut conclusion in terms of relative perfor¬ 
mance cannot be drawn at the moment, it is fair to say 
that: (a) the new method developed in this paper outper¬ 
forms the classical “Stable-Spline” [42], especially when 
dealing with MIMO systems; (b) the new method out¬ 
performs a reweighted Nuclear Norm algorithm in cer¬ 
tain scenarios (e.g. a “mildly-resonant” fourth order sys¬ 
tem) while performing comparably in others (e.g. ran¬ 
domly generated “large” MIMO systems). 

The paper is organized as follows. Section 2 introduces 
the problem and Section 3 briefly frames system iden¬ 
tification in the context of Bayesian estimation. In Sec¬ 
tion 4 Maximum Entropy arguments are used to derive 
a family of prior distributions. Section 5 illustrates our 
algorithm while Section 6 describes the adaptation of 
a Scaled Gradient Projection method, which is used to 
solve the marginal likelihood optimization problem. An 
extensive experimental study will be conducted in Sec¬ 
tion 7, while some concluding remarks will be drawn in 
Section 8. 

Notation 

In the following, R,R + := [0,oo),Z and N denote re¬ 
spectively the set of real, positive real, integers and nat- 
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ural numbers. R™ and R mxn will denote respectively 
the set of n-dimensional real vectors, and to x n real 
matrices. The transpose of A £ R mxn w iH be denoted 
A T . 0 n , 0 mxn and I n will denote respectively the zero 
vector in R n , the zero matrix in R mxrl and the n x n 
identity matrix. The symbol ® will denote the Kro- 
necker product, A f(p, cr) the Gaussian distribution with 
mean g and variance a. Given v £ R n , diag(v ) will 
be a diagonal matrix of size n x n with the diagonal 
given by v. Given matrices V) £ t miXni , i = l,..,n, 
blkdiag(Vi ,..., V n ) will denote the block-diagonal matrix 
of size (to i + ... + to„) x (rii + ... + n n ) with the V)’s as 
diagonal blocks. E[-] and Tr{-} will respectively denote 
expectation and trace. 

2 Problem Formulation 

We consider the following linear, causal and time- 
invariant (LTI) Output-Error (OE) system: 

V(t) = H(q)u(t ) + e(t) ( 1 ) 

where y{t) = [yi(t), ..,y p (t)] T € R p is the p-dimensional 
output signal, u(t) = [«i(f),.., u m (t)) T £ R m is the Tri¬ 
dimensional input signal, e(t) is additive noise and 

OO 

H(q)=J2 h ( k )q- k ( 2 ) 

k —1 

is the system transfer function with q being the back¬ 
ward shift operator: q~ 1 u(t) = u{t — 1). For simplicity, 
we will assume the presence of a delay in H(q), i.e. h(0) = 
H(o o) = 0. In addition, we assume e(t) ~ A/’(0 p ,E), 
E = diag(cr), a = [<ti, ..., a p ] T . 

The objective is to estimate, from a finite set of input- 
output data Pat = {u(t),y(t); t = 1, N}, the impulse 
response coefficients {h{k) £ R pxm ; k = 1, ...,oo}. 

In the remaining of the paper, we shall consider 
{(/(<); t £ Z} and {u(t); t £ Z} as jointly stationary 
zero-mean stochastic processes; furthermore, the in¬ 
put signal is assumed to be independent of the noise 
{e(i); t £ Z}. The results of this paper can be easily 
extended to VARMAX/BJ type model structure, for¬ 
mulating the identification problem as estimation of the 
predictor model as done in [42]. 

3 Bayesian/regularization approach 

In line with the recent developments in linear system 
identification, we tackle the problem outlined in Section 
2 by adopting a Bayesian approach. Namely, we consider 
h as the realization of a stochastic process, embedding h 
in an infinite-dimensional space. For simplicity, consider 
the Single-Input-Single-Output (SISO) case. A typical 


choice is to model h as a zero-mean Gaussian process 
with covariance function K v : R + x R + —> R, 

E[/i(t)/i(s)] = K v (t,s) (3) 

where K n is parametrized via the hyper-parameter vec¬ 
tor rj £ Q C R d .The covariance function K v (t,s), also 
called “kernel” in the machine learning literature, is ap¬ 
propriately designed in order to account for the desired 
properties of the impulse response to be estimated (e.g. 
stability, smoothness, etc.; see [42,10,44]). 

In this Bayesian framework the minimum variance es¬ 
timate of h conditional on the observations {y(t); t = 
l,...,iV}, on the hyper-parameters g and on the noise 
covariance E is the conditional mean: 

h:=E[h\Y,g,a\ (4) 

where Y £ is the vector of output observations: 

Y:=[yi( 1) ••• 2 d (AO | ••• \y p (l) ••• y p (N)} T (5) 

Assuming also that the noise e is Gaussian and indepen¬ 
dent of h, Y and h will be jointly normal, so that for 
fixed 77 and a, h conditioned on Y is Gaussian. The esti¬ 
mator (4) is then available in closed form; in particular, 
when 77 and a are replaced with estimators 77 and d, (4) is 
referred to as the Empirical Bayes estimate of h [50]. Es¬ 
timates of 77 and (J can be found e.g. by cross-validation 
or marginal likelihood maximization, i.e. 

( 77 ,d):=arg max p(Y\g, a) ( 6 ) 

r)£il,cr>0 

where p(Y\g,a) denotes the likelihood of the observa¬ 
tions Y once the unknown h has been integrated out, 
commonly called the marginal likelihood. Under the 
Gaussian assumptions on h and on the noise e also the 
marginal likelihood p(Y\rj, a) is a Gaussian distribution. 

According to the Bayesian inference procedure outlined 
above, the impulse response h to be estimated lies in 
an infinite-dimensional space. However, thanks to the 
(exponentially) decaying profile of a stable impulse re¬ 
sponse, it is possible to estimate only a truncated ver¬ 
sion of h, i.e. to approximate H(q) with the transfer 
function of a long Finite Impulse Response (FIR) model 
Hr{q) = Sfc=i h(k)q~ k -, in this way one avoids dealing 
with infinite-dimensional objects. It should be stressed 
that the choice of the length T does not correspond to a 
complexity selection step, since T is simply taken large 
enough to capture the relevant dynamics of the unknown 
system. Henceforth, we will denote with h £ R Tmp the 
vector containing all the impulse response coefficients of 
{ h{k)\ k = 1 , ...,T}, appropriately stacked: 

h = [hJi ■■■ hj m ■■■ hJ-L • • • hJ m ] T (7) 

h '77 [A'i( 1 ) hij(2) * • • hij(T )] i — 1, ..,p, j — 1, .., 777 . 
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hij(k) represents the fc-th impulse response coefficient 
from input j to output i. Under the Bayesian frame¬ 
work, h is a Gaussian random vector h ~ 7V(0 Tmp, Kp ), 
K v £ ^TmpxTmp _ Exploiting the notation introduced in 
(7) and using the FIR approximation, the convolution 
equation (1) can be reformulated as a linear model: 

Y = $h + E, <S> G M NpxTmp, E g R Np (g) 

where the vector E collects the noise samples, while the 
d> = blkdiag(<fi, ...,</>) with (f> £ ffi .NxTm defined as: 


V?i(l) <p 2 (l) ••• <Pm(X) 
V?i(2) v? 2 (2) ••• (fim{ 2) 


l 


_<Pi(N) tp 2 (N) ••• ip m (N) _ 

m{j - 1) Ui(j - 2) • • • m{j - T) 

1 ,-,m, j = 1, N (9) 


Remark 3.1 The Bayesian inference scheme here illus¬ 
trated has a well-known connection with the theory of Re¬ 
producing Kernel Hilbert Spaces (RKHS). Indeed, once 
a Gaussian prior for the impulse response h is postu¬ 
lated with the covariance function defined in (3), the op¬ 
timal estimate h can also be derived as the solution of a 
Tikhonov regularization problem and will be an element 
of the RKHS H k associated to the kernel K n . If the true 
impulse response h belongs to H^ , then the so-called 
“model bias”, accounting for the error between the true 
h and its closest approximation in the hypothesis space, 
disappears ([28], Sec. 7.3). In particular, the RKHS asso¬ 
ciated to the so called stable spline kernel (adopted in the 
sequel) is very rich. For instance, the impulse response 
of any BIBO stable finite dimensional linear system be¬ 
longs to Hjf for a suitable choice of rj. In practice, g is 
estimated by (6): this permits to tune model complexity, 
trading bias and varianc ^| in a continuous manner. 


4 Derivation of stable Hankel-type priors 


Since h and E are modelled as Gaussian and indepen¬ 
dent, Y and h are jointly Gaussian and h conditionally 
on Y is Gaussian, so that (4) takes the form: 


h := E[h|Y, g, a] = 




k: 


$ T £ -1 y 


( 10 ) 

with £ := £ (g) Jjv. By recalling a known equivalence be¬ 
tween Bayesian inference and regularization, the previ¬ 
ous estimate can also be interpreted as the solution of the 
following Tikhonov-type regularization problem [56]: 


h = arg min (Y - 4>h) T E- 1 2 (F - $h) + J„(h) (11) 

heR Tm p 


with J r/ (h) = h T A'“ 1 h. 

The previous expression shows that the choice of the ker¬ 
nel K v plays a crucial role for the success of the Bayesian 
inference procedure. Indeed, it shapes a regularization 
term ^(h) which penalizes impulse response coefficients 
corresponding to “unlikely” or “undesired” systems. In 
Section 4 we will develop a new class of kernels which 
induces a penalty of the type: 


In recent contributions the standard smoothing spline 
kernels [56] have been adapted in order to represent co- 
variances of exponentially decaying functions ([43], [42]). 
For instance, considering SISO systems, the 1st order 
stable spline kernel (see [43] and [10] where it has been 
named Tuned-Correlated (TC) kernel) is defined as 

[K s ,u\ kl =c mm{f3 k ,p 1 } (13) 

where v = [c, /3], c > 0, 0 < /3 < 1 play the role of 
hyper-parameters. For a suitable choice of /3, the im¬ 
pulse response of any BIBO linear system belongs a.s. 
to the RKHS associated to the kernel in (13), see [43]; 
thus, by adopting this kernel the “model bias” is zero. 
Recently, [9] has shown that the kernel function from 
which (13) derives admits a Maximum Entropy inter¬ 
pretation. More specifically it is the covariance function 
of a zero-mean Gaussian process defined over N+, which 
is the solution to the Maximum Entropy problem with 
constraints (k = 1,..., k £ N+) 

Var [h(k + 1) - h(k)} =c(/3 k - /3 k+1 ) (14) 


JshA h) = h T A^h + h t A-> (12) 

The first term in Jsh,t]( h) will account for the smooth¬ 
ness and stability of the impulse response to be es¬ 
timated, while the second one will penalize high- 
complexity models. Estimation of the hyper-parameters 
g and computation of the impulse response estimate 
h through an iterative algorithm will be discussed in 
Section 5. 


Exploiting a well-known result on Maximum Entropy 
distributions, see e.g. [19, p. 409], the zero mean Gaus¬ 
sian prior with covariance (13) can also be derived by 
imposing the constraint]^] 


E 


^ T K~s> 


= c 


(15) 


1 For the reason discussed above only “estimation-bias” will 
be present. 

2 Note that constraint (15) contains (14). 
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When dealing with MIMO systems one needs to consider 
a block-kernel, with the j*-th block Kg l J £ R TxT (the 
cross-covariance of the impulse response from the 7-th 
input and the j-th output) defined e.g. as in (13). In 
the recent literature, see e.g. [18], the cross terms (i.e. 
Kg l J , 7 7 ^ j) have been set to zero. As we shall argue in 
a moment, this assumption is often unreasonable. 

In fact, while smoothness is considered as a synony¬ 
mous of “simplicity” in the machine learning literature, 
a system theoretic way to measure complexity is via the 
McMillan degree of h, i.e. the order n of a minimal state 
space realization 


x (t + 1) = Ax(t) + Bu(t) 
y(t ) = Cx{t) + eft) 


x(t) £ 


(16) 


The impulse response coefficients of (16) are given by 
h(k) = CA k ~ 1 B £ R pxm , a relation which couples the 
impulse responses hij ( k ) as 7 and j vary. This calls for 
prior distributions on h which encode this coupling. To 
this end, we first introduce the block Hankel matrix 
77(h) £ 


n( h) = 


rxmc given by: 



Mi) 

M2) 

M3) • 

Me) 

M2) 

M3) 

M4) • 

M c + i) 

_h(r) 

h(r + 1 ) 


• h(r + c — 1 ) 


(17) 


A classical result from realization theory (see [52] for 
details) states that the rank of the block Hankel matrix 
77(h) equals the McMillan degree of the system, if r and 
c are large enough. In this work r and c are chosen so 
that r + c — 1 = T and the matrix 77(h) is as close as 
possible to a square matrix. 

From now on, to the purpose of normalization, we shall 
consider a weighted version 77 ( h) of 77(h): 


where pk is the k — th canonical correlation coefficient 
and n is the McMillan degree of a minimal spectral fac¬ 
tor of y. 

This provides a clear interpretation of canonical correla¬ 
tions as well as of the impact of shrinking them in terms 
of mutual information. A similar interpretation holds for 
systems with inputs, relating conditional mutual infor¬ 
mation and conditional canonical correlations, i.e. sin¬ 
gular values of (18) with the proper choice ofW\ andW 2 - 


f.l Maximum Entropy Hankel priors 


We shall now introduce a probability distribution p(h) 
for h, such that samples drawn from p(h) have low rank 
(or close to low rank) Hankel matrices. To this purpose, 
we would like to favour some of the singular values of 
77 ( h) to be (close to) zero: this can be achieved impos¬ 
ing constraints on the eigenvalues of the weighted ma¬ 
trix 77(h)77(h) T . Let Mj(h) be the 7-th singular vector 
of 77(h)77(h) T . To achieve our goal we shall constrain 
the (expected value) of the corresponding singular value 
sf(h), i.e. 


E[s?(h)]=E u t (h) T H(h)H(h) T Ul (h) 


< W; 


( 20 ) 


fori = 1, ...,pr. Here the expectation is taken w.r.t.p(h), 
while the u>i ’s play the role of hyper-parameters that will 
have to be estimated from the dataH 

In order to design p(h), we first assume that an esti¬ 
mate h of h is available. We shall see in Section 5 how 
this “preliminary” estimate of h arises as an intermedi¬ 
ate step in an alternating minimization algorithm. 
Thus, we consider the (weighted) estimated Hankel ma¬ 
trix 77(h) and its singular value decomposition 

USU T := 77(h)77(h) T (21) 

We can now reformulate the constraints (20) as 


77(h) := Wj 77(h) W? (18) 

where W\ and W 2 are chosen, see [15], so that the sin¬ 
gular values of 77(h) are conditional canonical correla¬ 
tion coefficients between future outputs and near past 
inputs, given the future inputs and remote past inputs. 

Remark 4.1 For Gaussian processes, there is a one-to- 
one correspondence between the Canonical Correlation 
Analysis (CCA) and mutual information. Indeed, the 
mutual information between past (y~) and future (y + ) 
of a Gaussian process {y(t)}tez is given by: 


E 


7t( r 77(h)77(h) T 7i i 


< W 7 ; 


i=l,...,pr (22) 


where Ui denotes the 7-th column of U. In this way we 
have fixed the vectors Ui, so that only 77(h) is random 
in (22). Fixing the uf s, which in general are not the 
(exact) singular vectors of the “true” Hankel matrix, 
introduces a perturbation on the constraint (and thus on 
the resulting prior distribution). One way to make the 
constrains ( 22 ) robust to such perturbations is to group 
estimated singular vectors into the so-called “signal” and 


1 

Kv + ; y~) = ~2 5Z log ( 1- ^) 

k =1 


(19) 


In fact, one shall not estimate directly the wfs, but rather 
the corresponding dual variables appearing in the MaxEnt 
distribution, i.e. the Ads in (26). 
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“noise” subspaceQ To this purpose let us group the 
first n singular vectors and partition U and S as follows: 

U=[u n u£] s = blkdiag(S n , S^) (23) 


where U n £ R prxn . Note that, while the uf s correspond¬ 
ing to small singular values are likely to be very noisy, 
both the “signal” space spanned by the columns of U n , 
as well as that spanned by Ui, i = n + 1, ..,pr (i.e. the 
column space of U^) are much less prone to noise; this 
is easily derived from a perturbation analysis of the sin¬ 
gular value decomposition which shows that the error 
in depends on the gap between the smallest singu¬ 
lar value of S n and the largest one of S);. In view of 
these considerations, we can relax the constraints (22) 
by aggregating the “signal” components (i.e. the first n 
singular vectors): 


E 


Tr{p„ T P(h)P(h) T [/„}] <£?=1 


(24) 


where well known properties of the trace operator have 
been used. Similarly, we group the constraints on the 
“noise” component (i.e. the last pr — n singular vectors) 


E 


Tr 


{u^Y n(h)n(h) T u^ 


pr 

< H 

i=n -\-1 


(25) 


Exploiting a well known result [19, p. 409], we can build 
the Maximum Entropy distribution subject to the con¬ 
straints (24) and (25): 


p c (h) <xexp(-A 1 Ti-{(7jH(h)W(h) T f7„}) • 

exp UA 2 Tv|(^) T H(h)7i(h) T t/ n ± jj 
oc exp (-Tr {u T H(h)H(h) T U blkdiag{XJ n , A 2 / pl ._ n )}) 
<xexp(-Tr{p(h)P(h) T Q(C)}) (26) 

where ( := [Ai, A 2 ,n], Ai >0, A 2 > 0, and 

Q(C) :=U blkdiag(\il n , X 2 I pr -n ) U r (27) 


Remark 4.2 We would like to stress that the quality of 
the relaxation introduced in constraints (24) and (25) 
depends on the relative magnitude of the Hankel singu¬ 
lar values. Using the “normalized” Hankel matrix (18) 


4 In fact, in this way perturbations “within” the signal and 
noise subspaces respectively have no effect. 


plays an important role here since its singular values, be¬ 
ing canonical correlations, are in the interval (0,1]. On 
the other hand, the aggregation of the singular values 
along the “noise” subspace resembles the role played by 
the regularization factor in Iterative Reweighted meth¬ 
ods [7,57]. We refer to Appendix B for a thorough discus¬ 
sion on the connection between these methods and our 
approach. 


Remark 4.3 Notice thatQ (() in (27) is the sum of two 
orthogonal projections Q(() = XiU n U[] + X 2 U„ (u^ J , 
respectively on what we called the “signal subspace” (that 
would coincide with the column space of Hi h) ifn was the 
true system order) and on the “noise subspace”. This ob¬ 
servation provides new insights on the design of the prior 
in (26): namely, by properly tuning the hyper-parameters 
0 the prior is intended to be stronger along certain di¬ 
rections of the column space of 77(h) (referred to as the 
“noisy” ones) and milder along what we call the “signal” 
directions. 


Since TL( h) is linear in h, Tr j'W(h)'H(h) T Q(£) j is 

quadratic in h and letting <9(0 = LL T , it can be 
rewritten as: 

Tr {-H(h)-H(h) T Q(0} = Tr {l t H( h)H(h) T l} (28) 

= ||vec('H(h) T L )||2 

= \\(L T Wj ® W 1 )vec(77(h) T )||2 

= h T P T (W 2 Q{()Wj 0 W? Wi)Ph (29) 

where P £ j^rpcmxTmp that vec (77(h) T ) = Ph. 

Inserting (29) in (26) we obtain 

p c {h) oc exp (-h T P T (W 2 <9(C)W 2 T 0 Wj Wi)Ph) 

(30) 

i.e. for given 0 h is a zero-mean Gaussian vector: 

h~AA(0 Tmp,K HX ) (31) 


with 


K h , c = 


P T (W 2 Q(QW. 2 T 0 WjW X )P 


C — [Ai, A 2 , n] 


(32) 

(33) 


Using (31) as a prior distribution for h, we can recast 
the problem of estimating h under the framework out¬ 
lined in Section 3. In particular, complexity (in terms 
of McMillan degree) is controlled by properly choosing 
the hyper-parameters Q. which can be done by marginal 
likelihood maximization as further discussed in Section 
5. 


6 









Remark 4.4 From the regularization point of view (11), 
the penalty induced by the kernel (32) can also be derived 
through a variational bound. We refer the reader to [48] 
for more details about this derivation. 

We shall notice that, when £* = [A*, A*, 0], quantity (28) 
(from which kernel (32) arises) reduces to 

w{i?(h)£(h) T Q(C*)} =Tr{n(h)H(h) T \*I rp } 

= A*£s?(h) (34) 

i 

where s^(h) are the singular values of H(h). Thus, the 
nuclear norm penalty on the (squared) Hankel matrix 
can be derived from kernel (32) as a special case, i.e. for 
a special choice of the hyper-parameters. The use of nu¬ 
clear norm regularization is not new in system identifi¬ 
cation: a comparison with the literature can be found in 
Appendix A. 

f.2 Maximum Entropy stable-Hankel priors 

As thoroughly discussed in [40], the kernel arising from 
the “Hankel” constraint alone would not necessarily lead 
to stable models. In fact given an unstable system and 
its finite Hankel matrix H, it is always possible to de¬ 
sign a stable system whose finite Hankel matrix (of the 
same size as H) has the same singular values of H. In 
addition, the Hankel prior does not include information 
on the correlation among the impulse response coeffi¬ 
cients (see [40]). Thus, as a final step, we shall consider 
the Maximum Entropy distribution [19, p. 409], under 
both stability (15) and low complexity ((24) and (25)) 
constraints, thus obtaining 

Ptj{ h) oc exp (-A 0 h T A<7],h - h K~^ ( h) 

oc exp (-h T (AoAAtJ, + Kfj 1 ^ h) (35) 

where p = [v, Aq, £], Ao > 0, and Kh,c is the kernel 
in (32). The use of a further hyper-parameter, Ao, will 
become clear later on. From the distribution (35) we can 
derive the kernel 

K S h, v = (a 0 ~ l (36) 

= Ao Kgl + P T (W 2 Q{C)Wj <g> W? Wi)pJ _1 

with hyper-parameters 

V = W, A 0 , C] (37) 

and £ as defined in (33). 


5 Identification Algorithm 

This section describes the iterative algorithm to estimate 
the impulse response h when the prior is chosen as in 
(35). The algorithm alternates between the estimation 
of h (see ( 10 )) for fixed hyper-parameters and marginal 
likelihood optimization (see ( 6 )). 

The procedure is summarized in Algorithm 1. For ease 
of notation we have defined A := [Ao, Ai, A 2 ]• Hence, the 
hyper-parameters vector 77 in (37) can be rewritten as 

V = [”> A 0 ,C] = [v, A 0 ,Ai, A 2 ,n] = [v,\,n\ 

Furthermore, •W-* and h ^ denote estimators 

at the k -th iteration of the algorithm. 


Remark 5.1 In Algorithm 1 the noise variance a is 
fixed e.g. to the sample variance of an estimated ARX or 
FIR model. Of course a could also be treated as a hyper¬ 
parameter, and estimated with the same procedure based 
on the marginal likelihood. 

Remark 5.2 Algorithm 1 has strong connections with 
iterative reweighted algorithms, see Appendix B for de¬ 
tails. 

Algorithm 1 Identification Algorithm 
1: Set the resolution e > 0 
2: Estimate <7 as illustrated in Remark 5.1. 

3: n[°'> <- 0_ 

4: Ufai 0 ) = Uq i 0 r pxrp 
5: tA(o) Irp 

6: v <- argmax„ € n p(Y\u, [1,0,0], h (0) ,<j) 

7: A^ 0 ) «— argmax AeR 3 p(Y |z>, A, h^°\ a) 

8: k «- 0 

9: while < pr do 

10: h^ ^ E[h|y,77( fc ),dHusingJ10)) 

11: Compute the SVD: = USU T 

12 : h( fc+1 ) <r- ii^l ^ ^ 

13: Determine U^k+x) and t/A (fc+1) from U. 

14: A^ fc+1 ) <— argmax AeK 3 p(Y\i>, \,nf k+1 \a) 

15: if p(V|i>, A(' s + 1 >,n(' s + 1 ),S') > (l+e)p(y|i>, x(*),n( fc+1 ),S) 

then 

16: k <— k + 1 

17: else 

18: h (k+1) <- h (k) + 1 

19: Perform steps 13 to 14. 

20: if p(Y\i>, W+^.ftO+O.a-) > (1 + e) P (y|i>, Y k \ h ( ~ k + 1 \a) 

then 

21 : k^k + l 

22: else 

23: break 

24: end if 

25: end if 

26: end while 
27: Return h <- h (fc) 
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Notice that the marginal likelihood maximization per¬ 
formed in steps 7 and 14 of Algorithm 1 boils down to 
the following optimization problem: 

A^ = arg min Y T A(i>, A, h^ k \ cr) _1 y+ 

AeR^_ 

log |A(j>, A, n (fe \ <t)| (38) 

where _ 

A(r?,a) :=E + ^K S h, v ^ T (39) 

Section 6 will illustrate a Scaled Gradient Projection 
(SGP) method appropriately designed to solve (38). We 
shall now discuss issues related to initialisation and con¬ 
vergence of Algorithm 1. 


5.1 Algorithm Initialization 


In the derivation of kernel Ksh,ti in Section 4 it has 
been assumed that a preliminary estimate h was avail¬ 
able. Therefore the iterative algorithm we outline in this 
section has to be provided with an initial estimate 
Exploiting the structure of the kernel Ksh.^ in (36), two 
straightforward choices are possible: 


(1) Initialize using only the stable-spline kernel (as the 
one in (13)), i.e.: 


h(°) = ^ 




s,e(°) 




r)(°) = 


i> (o) ,A (o) ,0 , A (0) = [1,0,0] (40) 


where only the hyper-parameters are estimated 
through marginal-likelihood maximization (6). 

(2) Initialize using the stable-Hankel kernel with h = 0, 
so that no preliminary estimate is needed to initial¬ 
ize U n (which is empty) and thus Q(C^) = A ^1: 


hl» ) = (* T E- , 4> + A--;, ( „y 


r)(°> = 


j>C°),A<°),0 , A<°> = 


$ t £ -1 F 
i x(°) \(°) 


(41) 


where 5^ and A^ are estimated through marginal 
likelihood maximization (6). 

The procedure we actually follow (illustrated in Algo¬ 
rithm 1) combines the two strategies above. Namely, the 
first approach is adopted to fix the hyper-parameters 
v defining the stable-spline kernel (line 6). These are 
then kept fixed for the whole procedure. We then follow 
the second strategy to estimate A^ 0 ^ (line 7). Note that 
in line 7 the hyper-parameters v are fixed to v and not 

estimated as in (41). Analogously, Ag°^ is estimated by 
marginal-likelihood maximization and not set a-priori 
to 1 as in (41). Therefore, the estimate h^ 0 ) computed 


at line 10 is derived by adopting the kernel K SH fj (o) 
with fj = \v, A(°), 0]. 

This sort of “hybrid” strategy has been chosen for two 
main reasons. First, it allows to fix the hyper-parameters 
v by solving a simplified optimization problem (w.r.t. 
solving a problem involving all the hyper-parameters 
??). Notice that this also provides the user with a certain 
freedom on the choice of the kernel Ksy. using other 
kernel structures (see e.g. [16]) additional properties 
(e.g. resonances, high-frequency components, etc.) of 
the impulse response can be accounted for. Second , it 
also allows to properly initialize the iterative procedure 
used to update the hyper-parameters Ao and £ in (37), 
until a stopping condition is met (see next section for a 
discussion about convergence of Algorithm 1). 


5.2 Convergence Analysis 

Algorithm 1 is guaranteed to stop in a finite number 
of steps, returning a final estimate h. Indeed, at any 
iteration k four possible scenarios may arise: 

(1) Condition at line 15 is met and k is increased by 
one and the algorithm iterates. 

(2) Condition at line 15 is not melTH so that h is in¬ 
creased by one, and condition 20 is not met, then 
the algorithm terminates returning h := lr fc ). 

(3) Condition at line 15 is not met 4 , so that h is in¬ 
creased by one, while condition 20 is met, then k is 
increased by one and the algorithm iterates. 

(4) n = pr , then the algorithm terminates returning 
h := h( fe ). 

Conditions (1) and (3) may only be satisfied a finite 
number of times, thus the algorithm terminates in a fi¬ 
nite number of steps. 

We also stress that Algorithm 1 is only an ascent algo¬ 
rithm w.r.t. the marginal likelihood without any guar¬ 
antee of convergence to a local extrema. If U fl <k) was 
treated as a hyper-parameter and the marginal likeli¬ 
hood optimised over the Grassmann manifold, then con¬ 
vergence to a local maxima could be proven]^] Notice 
indeed that we adopt a tailored Scaled Gradient Projec¬ 
tion algorithm to solve the marginal likelihood optimiza¬ 
tion problem at line 14 (see Section 6): every accumu¬ 
lation point of the iterates generated by this algorithm 
is guaranteed to be a stationary point ([3], Theorem 1); 
furthermore, for the specific problem we are solving, the 
sequence of the iterates admits at least one limit point. 


5 This certainly happens after a finite number of iterations 
for any positive resolution e and fixed h. 

6 We have tested this variant, which is considerably more 
computationally expensive than Algorithm 1. Since no sig¬ 
nificant improvements have been observed, we only present 
the simpler version in this paper. 









Once the algorithm has converged, h is the optimal di¬ 
mension of the “signal” and “noise” subspaces of H( h), 
respectively spanned by the columns of Ua and C/A. Fur¬ 
thermore, the corresponding multipliers Ai and A 2 in £ 
are expected to tend, respectively, to 0 (meaning that 
no penalty is given on the signal component) and to 00 
(that is, a very large penalty is assigned to the noise sub¬ 
space); if A 2 = 00 , h would actually be the McMillan 
degree of the estimated system. 

In practice the estimated hyper-parameter A 2 is finite 
and, similarly, Ai is strictly positive. As a result the 
McMillan degree of the estimated system is generically 
larger than, but possibly close to, h. Therefore, esti¬ 
mation of the integer parameter n should not be inter¬ 
preted as a hard decision on the complexity as instead 
happens for parametric model classes whose structure is 
estimated with AIC/BIC/Cross Validation. Therefore, 
we may say that Algorithm 1 performs a “soft” com¬ 
plexity selection, confirming that this “Bayesian” frame¬ 
work allows to describe model structures in a continuous 
manner; in fact, for any choice of n, systems of different 
McMillan degrees are assigned non zero probability by 
the prior. 


6 SGP for marginal likelihood optimization 

A crucial step in Algorithm 1 is the marginal likelihood 
maximization (step 14) which is computationally expen¬ 
sive, especially when the number of inputs and outputs 
is large. To deal with this issue we have adapted the 
Scaled Gradient Projection method (SGP), proposed in 
[3], to solve 

min/(A) (42) 

/(A) = V t A(i>, A, h, d-) -1 y + log |A(z>, A, h, a)\ (43) 

with Cl = The SGP is a first order method, in which 
the negative gradient direction is doubly scaled through 
a variable step size ak and a positive definite scaling ma¬ 
trix Dk, which are iteratively updated. A careful choice 
of these scalings, illustrated later on, allows to speed up 
the, theoretically linear, convergence; the reader is re¬ 
ferred to [3] for details. The main steps of the algorithm 
are as follows (see Algorithm 2 for details): 

(1) Set the descent direction (scaled negative gradient) 

__ 

AA :=-<**£>* V /(\ {k) ) (44) 

(2) Project the candidate update A = X (k> + AA on 
the constraint set 

n nn- 1 ^) = argmin(x - A ) T D^ l {x - A) (45) 

K i i 


and define the final descent direction: 

AA< fc >:=II n i(A)-A« (46) 

’ k 

Since Cl is the positive cone the projection is merely 
a truncation to non-negative values of A (and is 
independent of the scaling Dk). 

(3) Update A along the direction AA^ as follows: 

A (fc+i) := A (fc) + 4 AA (fc) (47) 

with the steplength 4 computed through an Armijo 
backtracking loop. 


Algorithm 2 Scaled Gradient Projection (SGP) 

method _ 

1: Choose the starting point A^ G Cl. 

2 : Set the parameters v, 7 G (0,1), 0 < a m i n < a max , 
0 < L m i n < L max and the positive integer M. 

3: for k = 0,1,2... do 

4: Choose Ofc G [a m i n , a max ] and the diagonal 

scaling matrix Dk such that L m i n < [Dk\ u < 
C'Tnaxi i I 5 2) 3. 

5: Set the candidate direction as in (44) 

6 : Compute the projection II n D -i (A) using (45) 

’ fc 

7: Compute the descent direction as in (46) 

8 : Set 5k = 1. 

9: if 

/(A w +<5 fc AA (fc) ) < /(A (fe) )+v4V/(A (fc) ) T AA (fe) 

then 

10: Go to step 14. 

11: else 

12: Set 5k = 7 5k and go to step 9. 

13: end if 

14: Set A (fe+1) = A (fe) + 6k AA (fc) . 

15: end for 


In step 4 of Algorithm 2, the stepsize a.k is chosen by 
means of an alternation strategy based on the Barzilai- 
Borwein rules (as in [3]), which aim at finding otk so that 
OfcDfc approximates the inverse Hessian matrix. 

The choice of the scaling matrix Dk strictly depends on 
both the objective function and the constraints of the op¬ 
timization problem. In our implementation we followed 
the choices made in [3]: Dk is set to be diagonal and its 
update is based on the split gradient idea. Let us first 
define 

/o(A) = V t A(A)- 1 Y, h( A) = log |A(A)| (48) 

where we have used the simplified notation A(A) = 
A(D, A, h, a). Moreover, we define 

Ksh,\ '■= [AqPo + AiTi + A 2 r2] 1 (49) 
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where v and n are fixed and 


r 0 = R-^ (50) 

r, = p t (w 2 UnUlwj 0 wj p (5i) 

r 2 = P T (w 2 U± (pit') T wj 0 w?w^j P (52) 

Now, indicating with Vj/( A) the gradient of / w.r.t. to 
A i, i = 0,1, 2, we have: 

V j/o(A) = y T A(A)- 1 $A^,Ar i A' S ff,A$ T A(A)- 1 y 

(53) 

VJi(A) = —TV [$ T A(A)- 1 $J? s ^Ar i lTsff,A] (54) 

From the positive definiteness of A(A) and the positive 
semidefiniteness of $KsH t \TiKsH,\& T , it follows that 
V,/o(A) > 0, VA £ R 3 . Furthermore, from Lemma II.1 
in [31], it follows that V;/i(A) < 0, VA £ M 3 . This shows 
how the gradient of the objective function (43) admits 
the following decomposition: 

V/(A) = V/ 0 (A) + VA(A) = B( A) - V(X) (55) 

with B{ A) = V/ 0 (A) > 0 and U(A) = -V/ X (A) > 0 
(here the inequalities have to be understood component 
wise). Using the gradient splitting (55), the Karush- 
Kuhn-Tucker optimality conditions for problem (42) 

AiVi/(A) = 0, Ai > 0 


can be written as the solution of a fixed point iteration 
(see eq. (4.8) in [3]) which leads to the scaling matrix D^: 


[D k \ u = min max L n 


Af } \ 

D(AW) ) 


(56) 


This choice of the scaling matrix has proven to be par¬ 
ticularly effective on ill-posed or ill-conditioned inverse 
problems, when it is combined with an appropriate 
choice of the stepsize a*,. 

Further details on the setting of the parameters involved 
in Algorithm 2 and on the adopted stopping criterion 
will be given in Section 7.6. 


The innovation process e[t) is always a zero-mean Gaus¬ 
sian white noise with standard deviation randomly cho¬ 
sen in order to guarantee that the SNR on each out¬ 
put channel is a uniform random variable in the inter¬ 
val [1,4]. For each scenario we test the identification 
procedures on three different data lengths, which can 
be roughly classified as “few/average/many” data. Each 
Monte-Carlo study includes Nmc = 200 runs. A brief 
illustration of the three scenarios follows. 


SI) We consider a fixed fourth order system with trans¬ 
fer function H(z) = C{zl — A)~ X B where 


A = blockdiag ^ 
B = [1 0 2 0] T 


0.8 0.5 
-0.5 0.8 


C = 


1 

0 

20 


0.2 0.9 
’ [-0.9 0.2 
111 " 
0.1 0 0.1 
0 2.5 0 


(57) 


The input is generated, for each Monte Carlo run, as a 
low pass filtered white Gaussian noise with normalized 
band [ 0 , g] where g is a uniform random variable in 
the interval [0.8,1]. The identification of system (57) 
using data generated by a band-limited input appears 
particularly challenging because the system is charac¬ 
terized by two high-frequency resonances. 

The three different data lengths that have been con¬ 
sidered are: Ai = 200, Aq 2 = 500, A i3 = 1000. 

52) For each Monte Carlo run H(z) is randomly gen¬ 
erated using the MATLAB function drmodel with 5 
outputs and 5 inputs while guaranteeing that all the 
poles of H{z) are inside the disc of radius .85 of the 
complex plane. The system orders are randomly cho¬ 
sen from 1 to 10. The input u(t) is zero-mean unit vari¬ 
ance white Gaussian noise. The three different num¬ 
bers of input-output data pairs that have been tested 
are: ./V 2j i = 350, A ^ 2 2 = 500, N 2 ^ = 1000 . 

53) The systems have been randomly generated simi¬ 
larly to scenario S2, but with 10 inputs and 5 outputs. 
Moreover, the input uit) is a low-pass filtered Gaus¬ 
sian white noise with normalized band defined as in SI. 
The considered data lengths are: N 3 i = 600, N 3 2 = 
800, A/ 3,3 = 1000 . 

7.2 Compared identification algorithms 


7 Simulation Results 

7.1 Monte-Carlo Simulations 

The identification procedure outlined in Algorithm 1 is 
now compared with off-the-shelf identification routines, 
as well as with recently proposed methods. The compar¬ 
ison is performed through some Monte-Carlo studies on 
three appropriately designed scenarios. 


The following algorithms have been testedfH 

(1) N4SID+Or: The subspace method, as implemented 
by the MATLAB routine n4sid. Different model 
complexities are tested; an Oracle chooses the order 
which maximises the impulse response fit (61). 


7 Some methods appeal to an Oracle (Or) who knows the 
true system. Clearly these are not feasible in practice and 
are only reported for the sake of comparison. 
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(2) N4SID(0E)+0r: As N4SID+0r but forcing the 
routine to return an Output-Error model. 

(3) N4SID: The MATLAB routine n4sid, equipped 
with default model order selection. 

(4) N4SID(OE): Same as N4SID by forcing an OE 
structure. 

(5) PEM+Or: PEM as implemented by the MAT- 
LAB routine pem. Different model complexities are 
tested: an Oracle chooses the order which max¬ 
imises the impulse response fit (61). 

(6) PEM(OE)+Or: Same as PEM+Or but using the 
routine oe. For each of the tested complexities, the 
routine oe has been initialized with the model re¬ 
turned by pem. 

(7) PEM: The MATLAB routine pem, equipped with 
the default model order selection. 

(8) PEM(OE): The MATLAB routine oe, initialized 
with the model returned by pem (order as selected 
by the default choice in pem). 

(9) N2SID: The identification routine proposed in [55] 
and implemented through the code available from 
http://users.isy.liu.se/en/rt/hansson/. 
This routine returns a state-space model in innova¬ 
tion form. The estimation of Output-Error models 
through N2SID has not been tested, since the rou¬ 
tine does not straightforwardly allow to force an 
OE model structure. 

(10) SS: The estimator (10) where K v is chosen to be 
the kernel TC introduced in [10]. The estimator is 
computed through the MATLAB routine arxRegul 
(imposing a FIR model structure). 

(11) NN+C V: A FIR model of order T estimated solving 

h = arg min ||F — ^>h||| + A*||77(h)||* (58) 

hgRmpT 

The optimization problem is solved through a tai¬ 
lored ADMM algorithm (as in [33]), while A* is de¬ 
termined through Cross-Validation. This procedure 
has also been tested by replacing 77(h) in (58) with 
77(h) (see (18)). 

(12) RNN+CV: A FIR model of order T estimated by 
iteratively solving 

h = arg min ||V — $h||| + A* ||W| 77(h) WJL 

(59) 

The weight matrices Wj and W r are updated at each 
iteration according to the procedure suggested in 
[37]. A* is selected through Cross-Validation. The 
case in which 77(h) in (59) is replaced with 77(h) 
has also been tested. 

(13) SH: The estimator returned by Algorithm 1 with 
Ks. v specified through the TC kernel. Q 


8 The MATLAB code is available upon request to the au¬ 
thors. 


Some implementation details follow. For SS, SH, 
NN+CV and RNN+CV, the length T of the estimated 
impulse response h has been set to 80 for scenario SI, 
to 50 for S2 and S3. 

The regularization parameter A in N2SID [55] has 
been chosen within a set of 20 elements logarithmically 
spaced between 10 -3 and 10 _1 for SI and 40 elements 
logarithmically spaced between 10 -3 and 10 5 for S2 and 
S3. The endpoints of these grids have been selected so 
that the estimated value of A is inside the interval. 

When observed that using cross-validation, the results 
were unreliable for the “few” data scenarios TV, !. To op¬ 
timize the performance, in scenarios S2 and S3 we have 
used two-thirds of the available data as training set and 
the remaining one third for the validation step. Instead, 
in scenario SI, the available data have been equally split 
into the training and the validation set. The regular¬ 
ization parameter A* has been selected from the vector 
v = Nt v , where N tra in is the length of the training 
dataset, while v is a vector of 25 elements logarithmi¬ 
cally spaced between 10 2 and 10 7 for SI, between 10 3 
and 10 7 for S2 and S3. 

7.3 Impulse Response Estimate 

To evaluate the estimators described above, we first in¬ 
troduce the so-called coefficient of determination (COD) 
between time series a and b: 


cod7v c (a, b) = 100 


V \ EtMk)-a ) 2 ) 


(60) 


where a = j^-Ea^i 0 ^)- The impulse response fit is 
measured using the average COD: 

^ 1 P m 

Ev c (h)=—xm cod Nc (hij,hij) ( 61 ) 

p i= l j =i 


where hij and hij denote the true and estimated impulse 
responses from input j to output i. We set N c = 1000, 
letting hij(k) =0, k = T + 1,..., 1000. 

Figures 1, 2 and 3 report the boxplots of (61) in the 
three scenarios by some of the identification techniques 
listed above. In particular, among the methods equipped 
with the oracle for model complexity selection, only the 
results of PEM+Or are shown, since it gives the best 
performance. As far as the subspace techniques are con¬ 
cerned, we only report N4SID(OE), because it generally 
performs slightly better than N4SID; analogously, only 
the results achieved by the routine PEM are illustrated, 
since the performance of PEM(OE) is worse. 

SH and RNN+CV achieve, among the procedures which 
can be practically implemented, the best performance in 
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Fig. 1. Scenario SI - Fit (61) obtained with different data 
lengths: 1 Vi,i = 200 (a), Ah ,2 = 500 (b) and Ah ,3 = 1000 (c). 

scenarios S2 and S3; instead, in scenario SI, RNN+CV 
has severe difficulties. It is also interesting to observe 
that the reweighted procedure in (59) (RNN+CV) 
improves the performance achieved by simple nuclear 
norm regularization (NN+CV) in all the scenarios ex¬ 
cept for SI. The results achieved imposing the nuclear 
norm penalty on the weighted Hankel matrix ’H(h) are 
not reported since they are in general slightly worse 
than those achieved by NN+CV and RNN+CV. 


Fig. 2. Scenario S2 - Fit (61) obtained with different data 
lengths: Afo.i = 350 (a), Ah ,2 = 500 (b) and Ah ,3 = 1000 (c). 

estimation datasets consisting of N — 500 data have 
been generated in this way. A set of validation data 
V v Nv = {u v (t), y v {t)\ t = 1,..., N v } was used to evaluate 
the COD for each system output, i.e. codw„ (Vi , Vi), i = 
1 , (see definition in (60)) with if denoting the one- 
step ahead predictor for the i-th output channel. Table 
1 compares the median, the 5th and the 95th percentiles 
of cod]y v (i achieved by the considered identifica¬ 
tion methods. 


7-4 Predictive performance 


7.5 Analysis of estimated Hankel singular values 


We compare the predictive performance of the meth¬ 
ods fisted in Section 7.2 over a specifically designed sce¬ 
nario. Namely, system (57) has been simulated with 
a unit variance white Gaussian noise input, while its 
output was corrupted by additive white Gaussian noise 
with a variance chosen in order to have SNR = 2. 200 


Figures 4, 5 and 6 are concerned with the ability in es¬ 
timating the Hankel singular values, which are grouped 
in the so called “signal singular values” (corresponding 
to the nonzero singular values of the true system) and 
“noise singular values” (corresponding to the zero singu¬ 
lar vaues of the the true system). Indeed, the top plot in 
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Table 1 

Modified Scenario SI - codjv„(yl\ j/i), N v = 500 (see (60)): median, 5th and 95th percentiles over 200 Monte-Carlo runs. 
Estimators are computed using 500 data (the best values among the realistic methods are highlighted in bold). 



md 

co<hv„( 2 /i> 2 /i) 

5th pctl 95th pctl 

md 

codiv 

5th pctl 95th pctl 

md 

codjv„ ( 2 / 3 , 3 / 3 ) 

5th pctl 95th pctl 

PEM+Or 

92.54 

87.69 

95.94 

92.76 

88.77 

96.14 

92.74 

88.06 

95.86 

SH 

91.48 

86.85 

95.03 

91.55 

86.60 

95.31 

91.46 

86.80 

94.83 

RNN+CV 

71.27 

65.35 

76.38 

69.94 

64.48 

74.83 

72.35 

65.44 

81.98 

NN+CV 

72.18 

66.44 

76.81 

69.38 

63.94 

74.29 

84.17 

78.80 

89.76 

PEM 

85.75 

59.86 

92.46 

86.12 

65.15 

92.84 

83.65 

52.76 

90.63 

N4SID(OE) 

82.42 

70.05 

89.71 

81.85 

66.69 

90.22 

88.36 

80.80 

92.39 

SS 

80.14 

76.19 

84.02 

80.06 

75.77 

83.16 

82.04 

76.43 

85.96 

N2SID 

34.78 

11.85 

51.04 

26.59 

7.57 

43.34 

58.95 

49.26 

65.79 
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Fig. 3. Scenario S3 - Fit (61) obtained with different data 
lengths: iV 3? i = 600 (a), iV 3) 2 = 800 (b) and iV 3)3 = 1000 (c). 


each figure shows the boxplots of the error on the “signal 


singular values”: 


n 

A signa i(h) := ^ |s»(h) - Sj(h)| (62) 

i= 1 

where h is the true impulse response vector, h is the esti¬ 
mated one, Sj(h) is the i-th normalized Hankel singular 
value and n here denotes the true system order. Simi¬ 
larly, the bottom plot contains the boxplots of the error 
on the “noise singular values”: 


pr p r 

A noise (h) := l«i( h ) - s*(h)| = ^ s»(h) (63) 

i=n -\-1 i=n -\-1 

Figure 4 shows that the poor performance observed in 
Figure 1 for NN+CV and RNN+CV is determined by 
the failure in detecting the “true” system complexity (as 
proven by the large error in the estimation of the “noise” 
singular values which can be interpreted as overestima¬ 
tion of the system order). On the other hand, the unsat¬ 
isfying performance of N2SID in Figure 1 is due to the 
under-estimation of the system complexity, which leads 
to a large bias in the estimation of the true Hankel singu¬ 
lar values (top of Figure 4) and to the correct detection 
of the “noise” subspace. Among the feasible methods, 
SH seems to correctly estimate the system complexity 
in most cases. 

With regards to scenarios S2 and S3, the joint analysis of 
Figures 2, 5 and 3, 6 reveals how the good performance 
in terms of impulse response fit achieved by PEM+Or 
and RNN+CV are mainly due to the correct reconstruc¬ 
tion of the “noise” subspace; indeed, the performance 
of SH in terms of fit are slightly worse even if it better 
recovers the “signal” subspace. A deeper inspection has 
revealed that the system complexity is underestimated 
by PEM+Or, RNN+CV and N2SID, thus explaining the 
almost perfect reconstruction of the “noise” subspace 
and the bias which affects the estimates of the “signal” 
subspace. This observation suggests that the good per¬ 
formance observed for RNN+CV in Figures 2 and 3 are 
favored by the nature of the systems in scenarios S2 and 
S3: indeed, underestimation of the system order does not 
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PEM SH RNN NN PEM N4SID SS N2SID 
+ Or + CV + CV (OE) 



PEM SH RNN NN PEM N4SID SS N2SID 
+ Or + CV + CV (OE) 

Fig. 4. Scenario SI - Top: Sum of absolute errors on the “sig¬ 
nal” normalized Hankel singular values (see (62)). Bottom: 
Sum of absolute errors on the “noise” normalized Hankel sin¬ 
gular values (see (63)). Considered data length: Nip = 500. 



Fig. 5. Scenario S2 - Top: Sum of absolute errors on the “sig¬ 
nal” normalized Hankel singular values (see (62)). Bottom: 
Sum of absolute errors on the “noise” normalized Hankel sin¬ 
gular values (see (63)). Considered data length: A/ 2,2 = 500. 



PEM SH RNN NN PEM N4SID SS N2SID 
+ Or + CV + CV (OE) 


Fig. 6. Scenario S3 - Top: Sum of absolute errors on the “sig¬ 
nal” normalized Hankel singular values (see (62)). Bottom: 
Sum of absolute errors on the “noise” normalized Hankel sin¬ 
gular values (see (63)). Considered data length: A 3,2 = 800. 

have a detrimental effect in these scenarios where there 
are many “small” Hankel singular values. 

Comparing the performance of NN+CV and RNN+CV 
in Figures 5 and 6 , it is clear that the reweighted proce¬ 
dure significantly increases the degree of sparsity in the 
estimated Hankel singular values. 

7 .6 Computational time 


A comparison of the methods listed in Section 7.2 is now 
done in terms of computational time. All algorithms were 
run on a server with two quad core Intel Xeon E5450 
processor at 3.00 GHz, 12 MB cache and 16 GB of RAM 
under MATLAB2014b. 

Table 2 reports the median, the 5th and 95th percentiles 
of the computational time over the 200 systems of sce¬ 
narios SI, S2 and S3, showing a clear gap in the perfor¬ 
mance of off-the-shelf methods (PEM, N4SID and SS) 
and non-off-the-shelf ones (SH, NN, RNN and N2SID); 
among the latters, our algorithm appears to be the least 
demanding one. 


In Section 6 a tailored Scaled Gradient Projection (SGP) 
method has been illustrated to solve the Marginal Like¬ 
lihood maximization problem at step 14 of Algorithm 1 
(see also (38)). To assess the benefits of SGP, we com¬ 
pare two implementations of Algorithm 1 which solve 
the above-mentioned optimization problem using, re¬ 
spectively, the MATLAB routine fmincon and the SGP 
Algorithm 2. In Table 3 execution times are reported for 
the three scenarios described in Section 7.1. ) 

The routine fmincon uses the interior-point algorithm 
and the default parameters setting (similar performance 
have been obtained through other algorithms, such as 
SQP or trust-region-reflective). The parameters involved 
in the SGP algorithm are set as follows: v = 10 —4 , 7 = 
0.4, OZ-min 10 , CXrnax 10 7 A min 10 ; A max 

10 10 . The following stopping criterion is adopted: 


/(A (fe) ) - /(A (fe+1) ) < l(T 9 |/(A (fe+1) )| 


For both the algorithms the maximum number of itera¬ 
tions has been fixed to 5000. 


8 Conclusion 

Casting linear system identification into the Bayesian es¬ 
timation framework, we have proposed a new Gaussian 
prior which has been derived using Maximum Entropy 
arguments under stability and complexity (McMillan de¬ 
gree) constraints. In particular, the part of the prior ac¬ 
counting for complexity controls the rank of the block 
Hankel matrix built with the Markov coefficients by in¬ 
ducing sparsity on the Hankel singular values; this, in 
turn, favours the estimated impulse response to lie on 
what we call “signal” subspace, i.e. the subspace spanned 
by the singular vectors corresponding to the “non-zero” 
Hankel singular values. 

We have designed an algorithm which iteratively refines 
the impulse response estimate by updating the hyper¬ 
parameters that define the prior and, in turn, by refining 
the estimate of the so-called “signal” subspace. At each 
iteration, the main computational burden is given by the 
hyper-parameters update, which is performed through 
marginal likelihood maximization. To reduce the com¬ 
putational effort required by this step, a suitably de- 
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Table 2 

Computational time (in sec) required to estimate a system: median, 5th and 95th percentiles over 200 Monte-Carlo runs. 
Estimators are computed using N . t 3 = 1000 data (best values among the realistic methods are highlighted in bold). 



md 

SI 

5th pctl 

95th pctl 

md 

S2 

5th pctl 

95th pctl 

md 

S3 

5th pctl 

95th pctl 

SH 

84.89 

43.70 

175.24 

67.22 

37.49 

548.89 

276.62 

129.07 

775.38 

RNN+CV 

418.93 

206.28 

1287.55 

95.87 

68.84 

584.58 

285.70 

196.60 

615.72 

NN+CV 

63.72 

58.29 

69.50 

49.51 

39.46 

206.27 

132.90 

110.32 

193.97 

PEM 

3.12 

2.44 

4.72 

1.60 

0.70 

12.89 

11.47 

1.01 

31.95 

N4SID(OE) 

1.54 

1.48 

1.67 

1.46 

0.96 

8.61 

7.99 

1.82 

36.34 

SS 

1.64 

1.47 

1.84 

10.51 

8.86 

13.23 

31.33 

25.58 

44.09 

N2SID 

666.74 

508.86 

851.72 

576.04 

462.73 

732.81 

504.84 

402.18 

764.83 


Table 3 

Computational time (in sec) required to estimate a system: median, 5th and 95th percentiles over 200 Monte-Carlo runs. 
Estimators are computed using N . : 3 = 1000 data. 




SI 



S2 



S3 



md 

5th pctl 

95th pctl 

md 

5th pctl 

95th pctl 

md 

5th pctl 

95th pctl 

fmincon 

1358.30 

853.80 

1893.10 

2545.10 

1322.80 

4816.80 

6651.60 

2951.60 

12732.00 

SGP 

84.89 

43.70 

175.24 

67.22 

37.49 

548.89 

276.62 

129.07 

775.38 


signed Scaled Gradient Projection (SGP) algorithm has 
been adopted. Simulations have highlighted the signifi¬ 
cant improvement achieved in terms of execution time 
of SGP w.r.t. standard off-the-shelf routines. 


The numerical comparison illustrated in this paper 
highlights some advantages of the proposed identifi¬ 
cation algorithm over state-of-the art routines. First, 
when MIMO systems have to be identified, our Hankel- 
based method appears more effective than the original 
regularization/Bayesian approach relying only on the 
“Stable-spline” kernel. Second, when compared with 
other methods which include a Hankel-type penalty 
(such as the the Reweighted Nuclear Norm (RNN)), it 
provides comparable performance on randomly gener¬ 
ated “large” MIMO systems, while it appears preferable 
on a fourth order “mildly-resonant” system. Third, 
with respect to more classical approaches, such as PEM 
and subspace algorithms (N4SID), our method provides 
more accurate estimates, especially in presence of a 
small identification dataset. 


The analysis of the estimated Hankel singular values has 
revealed how the final model estimates produced by the 
proposed algorithm are close to being of “low order”. 
Thus, future work will include the design and analysis of 
tailored model reduction techniques (preliminary work 
can be found in [47]). Furthermore, we plan to design 
a more efficient numerical implementation of our algo¬ 
rithm, as well as to extend its application and its com¬ 
parison with the other routines to the identification of 
ARMAX models. Finally, a deeper statistical analysis of 
our approach deserves to be conducted. 


A Connection with Nuclear Norm minimiza¬ 
tion approaches 

As observed in (34), through a special choice of the 
hyper-parameters £ in (33), kernel (32) induces a nuclear 
norm penalty on the (squared) Hankel matrix. Previ¬ 
ous works in the system identification literature have 
considered this kind of regularization, starting from the 
seminal work [23], where the nuclear norm heuristic 
was proposed for minimal order system approximation. 
In the context of subspace-type algorithms, [34] have 
replaced the SVD step of suitable “data matrices” with 
a nuclear norm penalty. This approach has then been 
extended to the case of missing input and output data 
[33] or to short data records [55]. Other variations of the 
method include a nuclear norm weighting [27] or a nu¬ 
clear norm minimization algorithm based on reweight¬ 
ing [37]. In [26] similar approaches have been proposed 
for handling missing data scenarios. 

The approach we propose differs from those discussed 
above mainly for three reasons. First, a special weight¬ 
ing scheme, depending upon three hyper-parameters 
is proposed, which is robust against overfitting and 
reduces bias. Second, casting the nuclear norm mini¬ 
mization step into a Bayesian framework allows to use 
marginal likelihood approaches to estimate the hyper¬ 
parameters: while these techniques have been shown to 
be robust against noise [41], they also allow to com¬ 
bine the weighted nuclear norm penalty with other 
penalties (as we have done in (36)). Third, while the 
above-mentioned works adopt a nuclear norm penalty 
on the Hankel matrix, here the penalty is imposed 
on the squared Hankel matrix, thus leading to an £2 
penalty on the Hankel singular values. This is essential 
in order to derive a Gaussian prior, implying that the 
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marginal likelihood is available in closed form. This fa¬ 
cilitates using the marginal likelihood to estimate the 
hyper-parameters. However, we should stress that in 
our approach sparsity in the Hankel singular values is 
favoured by the weighting Q (£). 


B Connection with Iterative reweighted algo¬ 
rithms 


Algorithm 1 shares key properties with the so-called it¬ 
erative reweighted algorithms, proposed by [38] and [57]. 
Considering a rank minimization problem, the algorithm 
introduced in [38] adopts a weighted trace heuristic as 
a surrogate to the rank function and iteratively updates 
the weighting matrix by means of a closed form expres¬ 
sion depending on the current optimal point. The trace 
heuristic considered in [38] has a clear analogy to the 
penalty term (28), in which Q(() plays the role of a 
weighting matrix. Also the structure of the matrix Q(C) 
in (27) resembles that of the weighting matrix in [38]. 
Specifically, following the approach in [38], the weight¬ 
ing Q^ at iteration k would be 

Q (fc) = (H(h (fc - 1) )H T (h (fc - 1) ) + £ :/ pr ) 1 (B.l) 

= (USU T + eI ^ 1 


where S denotes the singular values matrix and e is the 
regularization factor introduced in order to avoid nu¬ 
merical issues in the matrix inversion operation. Instead, 
our choice is 





EM 


-1 


(B.2) 


The similarity between (B.l) and (B.2) is apparent with 
I/A 2 playing the role of the regularization parameter e 
( J-— J- ) UnUJ being a rescaled and truncated 

\ Ai A2/ 


anc 

_| \ Ai A; 

version of USU T 


This peculiar structure of the weighting matrix, which 
arises from the maximum-entropy derivation of the 
prior, acts as an hyper regularizer which helps prevent¬ 
ing overfitting; the hierarchical Bayesian model provides 
a natural framework based on which regularization can 
be tuned through the choice of Ai and A 2 (see line 14 of 
Algorithm 1). 

9 Note that, even though no such constrained has been in¬ 
troduced, Ai < A 2 , so that ( -A-J- ) >0. 

\ Ai A2/ 


The Bayesian framework we adopted also connects our 
algorithm to the non-separable reweighting scheme pro¬ 
posed in [57] for solving a Sparse Bayesian Learning 
(SBL) problem: the algorithm iteratively alternates the 
computation of the optimal estimate and the closed-form 
update of the hyper-parameters matrix, as the algorithm 
we propose. The main difference between the cited al¬ 
gorithms and ours lies in the special structure of the 
weighting Q(C), which makes the weighting K$h de¬ 
pendent on the hyper-parameter vector A = [Ao, Ai, A 2 ] 
and n in a way such that closed form expressions for its 
update are not available. 
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